The steps involved in extracting usable features are:

Retrieving Gene Ontology annotations for each Entrez IDs
Comparing matches for all possible protein pairs

Retrieving Gene Ontology annotations for each Entrez ID

To do this we must:

Map the Entrez IDs onto corresponding Gene Ontology IDs
Use goatools to retrieve GO terms from gene_ontology.1_2.obo
Add these to a dictionary using the Entrez IDs as keys with dictionaries for each term

Alternatively, it looks like these terms are in the gene2go file retreived from the ncbi ftp site, meaning that we wouldn't need to use goatools:



In [1]:

    
cd ../../geneconversion/









    



/home/gavin/Documents/MRes/geneconversion



In [2]:

    
!head gene2go









    



#Format: tax_id GeneID GO_ID Evidence Qualifier GO_term PubMed Category (tab is used as a separator, pound sign - start of a comment)
3702	814629	GO:0005634	ISM	-	nucleus	-	Component
3702	814629	GO:0008150	ND	-	biological_process	-	Process
3702	814630	GO:0003700	ISS	-	sequence-specific DNA binding transcription factor activity	11118137	Function
3702	814630	GO:0005634	ISM	-	nucleus	-	Component
3702	814630	GO:0006355	TAS	-	regulation of transcription, DNA-templated	11118137	Process
3702	814630	GO:0006499	RCA	-	N-terminal protein myristoylation	-	Process
3702	814630	GO:0006635	RCA	-	fatty acid beta-oxidation	-	Process
3702	814630	GO:0006891	RCA	-	intra-Golgi vesicle-mediated transport	-	Process
3702	814630	GO:0016558	RCA	-	protein import into peroxisome matrix	-	Process

However, it's not certain and goatools seems to produce fairly good results, so using that for now:



In [3]:

    
import csv



In [28]:

    
from pylab import *



In [4]:

    
c = csv.reader(open("gene2go"), delimiter="\t")
#skip first line
c.next()
#initialise dictionary
goEntrezdict = {}
for line in c:
    #on every line use the Entrez ID as a key and initialise a dictionary
    goEntrezdict[line[1]] = {}

Then iterating through the file again we can add entries to this dictionary, each of which is a dictionary containing empty lists indexed by GO IDs:



In [5]:

    
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
for line in c:
    goEntrezdict[line[1]][line[2]] = []

Retrieving GO terms with goatools

Next we use goatools, along with the Gene Ontology flat database file gene_ontology.1_2.obo. Iterating through the Entrez IDs we have found and then iterating over the GO IDs which they refer to:



In [6]:

    
cd ../geneontology/









    



/home/gavin/Documents/MRes/geneontology



In [7]:

    
import goatools



In [8]:

    
parsedobo = goatools.obo_parser.GODag('gene_ontology.1_2.obo')









    



load obo file gene_ontology.1_2.obo
42995 nodes imported



In [9]:

    
for EntrezID in goEntrezdict.keys():
    for goID in goEntrezdict[EntrezID].keys():
        goEntrezdict[EntrezID][goID] = parsedobo[goID].name

Looking at an example entry:



In [10]:

    
print goEntrezdict[goEntrezdict.keys()[0]]









    



{'GO:0016021': 'integral component of membrane'}

Looking at crossover

We are looking for annotations that will cross over between pairs of proteins. To do this we need to pick a number of annotations we can test each pair of Entrez IDs for. These should be annotations that are likely to occur in pairs, any annotations that occur in isolation in our set will be useless.

So we have to count all the occurrences of every term in our dictionary of dictionaries. Start by flattening the list:



In [11]:

    
gotermlist = []
for EntrezID in goEntrezdict.keys():
    for goID in goEntrezdict[EntrezID].keys():
        gotermlist.append(goEntrezdict[EntrezID][goID])

Then count duplicates in the list using a dictionary:



In [12]:

    
dupdict = {}
#initialise counters
for k in gotermlist:
    dupdict[k] = 0
#count
for k in gotermlist:
    dupdict[k] += 1



In [13]:

    
import heapq



In [14]:

    
#find term that is repeated most often
print "Terms that occurred most often:"
for k in heapq.nlargest(10, dupdict, key=lambda x: dupdict[x]):
    print "\t"+"%s occurred %i times"%(k,dupdict[k])









    



Terms that occurred most often:
	molecular_function occurred 52423 times
	biological_process occurred 49804 times
	cellular_component occurred 38992 times
	nucleus occurred 38516 times
	cytoplasm occurred 28668 times
	integral component of membrane occurred 25343 times
	membrane occurred 21333 times
	plasma membrane occurred 18145 times
	protein binding occurred 15731 times
	mitochondrion occurred 12599 times

The three most common terms are not terms, they are domains. There must be a problem with goatools, or a problem with the database file. The rest of the most common terms are all localisations.

Splitting by domain

There are three domains in the Gene Ontology database. We could split the above to count within each of these quite easily. This was also done by Qi when using the Gene Ontology as features.

Repeating the above for each domain:



In [15]:

    
cd ../geneconversion/









    



/home/gavin/Documents/MRes/geneconversion



In [16]:

    
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
for line in c:
    goEntrezdict[line[1]][line[2]] = {}



In [17]:

    
for EntrezID in goEntrezdict.keys():
    for goID in goEntrezdict[EntrezID].keys():
        goEntrezdict[EntrezID][goID][parsedobo[goID].namespace] = parsedobo[goID].name



In [18]:

    
print goEntrezdict[goEntrezdict.keys()[4]]









    



{'GO:0005575': {'cellular_component': 'cellular_component'}, 'GO:0045596': {'biological_process': 'negative regulation of cell differentiation'}, 'GO:0000332': {'molecular_function': 'template for synthesis of G-rich strand of telomere DNA activity'}, 'GO:0007004': {'biological_process': 'telomere maintenance via telomerase'}}



In [19]:

    
gotermdict = {}
#initialise dictionaries
for EntrezID in goEntrezdict.keys():
    for goID in goEntrezdict[EntrezID].keys():
        for domain in goEntrezdict[EntrezID][goID].keys():
            gotermdict[domain] = {}
#initialise counters in dictionaries
for EntrezID in goEntrezdict.keys():
    for goID in goEntrezdict[EntrezID].keys():
        for domain in goEntrezdict[EntrezID][goID].keys():
            term = goEntrezdict[EntrezID][goID][domain]
            gotermdict[domain][term] = 0
#count
for EntrezID in goEntrezdict.keys():
    for goID in goEntrezdict[EntrezID].keys():
        for domain in goEntrezdict[EntrezID][goID].keys():
            term = goEntrezdict[EntrezID][goID][domain]
            gotermdict[domain][term] += 1



In [40]:

    
for k in gotermdict.keys():
    print "Domain %s:"%k
    for ke in heapq.nlargest(10, gotermdict[k], key=lambda x: gotermdict[k][x]):
        print "\t"+"%s occurred %i times"%(ke,gotermdict[k][ke])









    



Domain molecular_function:
	molecular_function occurred 52423 times
	protein binding occurred 15731 times
	metal ion binding occurred 12410 times
	ATP binding occurred 10840 times
	DNA binding occurred 10210 times
	zinc ion binding occurred 8135 times
	nucleotide binding occurred 7783 times
	sequence-specific DNA binding transcription factor activity occurred 7446 times
	hydrolase activity occurred 5134 times
	G-protein coupled receptor activity occurred 4838 times
Domain cellular_component:
	cellular_component occurred 38992 times
	nucleus occurred 38516 times
	cytoplasm occurred 28668 times
	integral component of membrane occurred 25343 times
	membrane occurred 21333 times
	plasma membrane occurred 18145 times
	mitochondrion occurred 12599 times
	cytosol occurred 11760 times
	extracellular region occurred 8696 times
	extracellular vesicular exosome occurred 6349 times
Domain biological_process:
	biological_process occurred 49804 times
	regulation of transcription, DNA-templated occurred 10962 times
	transcription, DNA-templated occurred 7165 times
	metabolic process occurred 7122 times
	transport occurred 6638 times
	signal transduction occurred 6188 times
	G-protein coupled receptor signaling pathway occurred 6024 times
	oxidation-reduction process occurred 5158 times
	proteolysis occurred 4574 times
	translation occurred 4021 times

Bait and Prey

The question is, which terms should we pick to be effective features? Obviously we want ones with a lot of crossover so there'll actually be something happening. But, we also want to make sure they'll be useful on the proteins we're actually going to use the classifier on at some point. The terms with the largest crossover on the the bait and prey proteins from crossover experiments, taking from all three domains, would probably be the best choice.

Retreiving the bait and prey protein list and building this list of terms to match:



In [26]:

    
cd ../forGAVIN/pulldown_data/BAITS/









    



/home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS



In [29]:

    
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"), delimiter="\t")))



In [30]:

    
cd ../PREYS/









    



/home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS



In [31]:

    
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"), delimiter="\t")))



In [32]:

    
cd ../../../geneconversion/









    



/home/gavin/Documents/MRes/geneconversion

Initialising new dictionary for just these Entrez IDs:



In [33]:

    
baitpreyEntrezdict = {}
#initialise dictionaries in this dictionary
for i in baitids+preyids:
    baitpreyEntrezdict[i] = {}
#iterate through gene2go file
c = csv.reader(open("gene2go"), delimiter="\t")
#skip first line
c.next()
for line in c:
    try:
        baitpreyEntrezdict[line[1]][line[2]] = {}
    except KeyError:
        #ignore Entrez IDs that don't match
        pass
#use goatools to retrieve domains and terms using IDs
#building dictionary with domains as keys this time
for entrezID in baitpreyEntrezdict.keys():
    goIDs = baitpreyEntrezdict[entrezID].keys()
    #reinitialise the dictionary
    baitpreyEntrezdict[entrezID] = {}
    for goID in goIDs:
        baitpreyEntrezdict[entrezID][parsedobo[goID].namespace] = []
    for goID in goIDs:
        baitpreyEntrezdict[entrezID][parsedobo[goID].namespace].append(parsedobo[goID].name)



In [34]:

    
print baitpreyEntrezdict[baitpreyEntrezdict.keys()[0]]









    



{'molecular_function': ['phosphoprotein binding'], 'cellular_component': ['growth cone', 'plasma membrane'], 'biological_process': ['neuron projection development']}



In [35]:

    
domains = ['molecular_function','cellular_component','biological_process']



In [36]:

    
#iterate over domains and count the occurence of terms:
gotermdict = {}
for domain in domains:
    gotermdict[domain] = {}
#initialise counters
for entrezID in baitpreyEntrezdict.keys():
    for domain in baitpreyEntrezdict[entrezID].keys():
        for term in baitpreyEntrezdict[entrezID][domain]:
            gotermdict[domain][term] = 0
#count
for entrezID in baitpreyEntrezdict.keys():
    for domain in baitpreyEntrezdict[entrezID].keys():
        for term in baitpreyEntrezdict[entrezID][domain]:
            gotermdict[domain][term] += 1



In [37]:

    
for k in gotermdict.keys():
    print "Domain %s:"%k
    for ke in heapq.nlargest(10, gotermdict[k], key=lambda x: gotermdict[k][x]):
        print "\t"+"%s occurred %i times"%(ke,gotermdict[k][ke])









    



Domain molecular_function:
	protein binding occurred 960 times
	poly(A) RNA binding occurred 308 times
	ATP binding occurred 232 times
	metal ion binding occurred 137 times
	identical protein binding occurred 100 times
	RNA binding occurred 99 times
	protein homodimerization activity occurred 97 times
	calcium ion binding occurred 88 times
	zinc ion binding occurred 79 times
	GTP binding occurred 78 times
Domain cellular_component:
	cytoplasm occurred 614 times
	cytosol occurred 575 times
	nucleus occurred 554 times
	extracellular vesicular exosome occurred 550 times
	plasma membrane occurred 449 times
	mitochondrion occurred 265 times
	integral component of membrane occurred 252 times
	nucleolus occurred 235 times
	nucleoplasm occurred 143 times
	membrane occurred 130 times
Domain biological_process:
	small molecule metabolic process occurred 266 times
	gene expression occurred 184 times
	viral process occurred 144 times
	signal transduction occurred 130 times
	cellular protein metabolic process occurred 125 times
	synaptic transmission occurred 114 times
	RNA metabolic process occurred 104 times
	mRNA metabolic process occurred 99 times
	transmembrane transport occurred 98 times
	blood coagulation occurred 97 times

Saving the 30 most common of each of these domains to a file in the geneontology directory:



In [38]:

    
cd ../geneontology/









    



/home/gavin/Documents/MRes/geneontology



In [41]:

    
!git annex unlock molecular_function.baitprey.term
!git annex unlock biological_process.baitprey.term
!git annex unlock cellular_component.baitprey.term









    



unlock biological_process.baitprey.term (copying...) ok
unlock cellular_component.baitprey.term (copying...) ok



In [42]:

    
for k in gotermdict.keys():
    f = open("%s.baitprey.term"%k, "w")
    for ke in heapq.nlargest(30, gotermdict[k], key=lambda x: gotermdict[k][x]):
        f.write("%s,%i"%(ke,gotermdict[k][ke])+"\n")
    f.close()



In [43]:

    
ls









    



biological_process.baitprey.term  gene_ontology.1_2.obo@   GOfeatures.class.test@
cellular_component.baitprey.term  geneontology.terms.txt@  molecular_function.baitprey.term
gene_association.goa_human@       goa.README@              testgen.pickle@



In [44]:

    
!head biological_process.baitprey.term









    



small molecule metabolic process,266
gene expression,184
viral process,144
signal transduction,130
cellular protein metabolic process,125
synaptic transmission,114
RNA metabolic process,104
mRNA metabolic process,99
transmembrane transport,98
blood coagulation,97

Building feature table and database

We intend to produce a files of the following form which can be used in conjunction with the ocbio.extract module:

Protein 1	Protein 2	GO molecular function feature 1	...	GO biological process feature 90
1234	3214	0	...	1
...	...	...	...	...

Where a 1 is placed in an entry if both proteins share that Gene Ontology term, otherwise it is zero. To build this file we must iterate through all combinations of all the Entrez identifiers in the gene2go file:



In [54]:

    
cd ../geneconversion/









    



/home/gavin/Documents/MRes/geneconversion



In [55]:

    
# read through this file and make a list of GO IDs for the example gene chosen above
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
#take second column and make it a set
EntrezIDs = []
for line in c:
    #only human IDs
    if line[0] == "9606":
        EntrezIDs.append(line[1])
EntrezIDs = set(EntrezIDs)



In [56]:

    
#build a dictionary as above for this list of Entrez IDs
goEntrezdict = {}
for i in EntrezIDs:
    goEntrezdict[i] = {}
#iterate through gene2go file
c = csv.reader(open("gene2go"), delimiter="\t")
#skip first line
c.next()
for line in c:
    try:
        goEntrezdict[line[1]][line[2]] = {}
    except KeyError:
        #ignore Entrez IDs that don't match
        pass
#use goatools to retrieve domains and terms using IDs
#building dictionary with domains as keys this time
for entrezID in goEntrezdict.keys():
    goIDs = goEntrezdict[entrezID].keys()
    #reinitialise the dictionary
    goEntrezdict[entrezID] = {}
    for goID in goIDs:
        goEntrezdict[entrezID][parsedobo[goID].namespace] = []
    for goID in goIDs:
        goEntrezdict[entrezID][parsedobo[goID].namespace].append(parsedobo[goID].name)



In [57]:

    
#get the lists of terms we're trying to match to:
terms = {}
for k in gotermdict.keys():
    terms[k] = []
    for ke in heapq.nlargest(30, gotermdict[k], key=lambda x: gotermdict[k][x]):
        terms[k].append(ke)



In [13]:

    
import itertools

Moving to external to avoid filling up my main hard drive:



In [1]:

    
cd /mnt/external/remotes/









    



/mnt/external/remotes



In [55]:

    
mkdir geneontology



In [2]:

    
cd geneontology/









    



/mnt/external/remotes/geneontology



In [57]:

    
#write header to file
c = csv.writer(open("geneontology.flat.pairs.txt", "w"),delimiter="\t")
line = ["Protein 1", "Protein 2"]
for k in terms.keys():
    for term in terms[k]:
        line.append("%s:%s"%(k,term))
c.writerow(line)



In [40]:

    
import scipy.misc



In [ ]:

    
#how long will this take?
npairs = scipy.misc.comb(len(EntrezIDs),2)
print "Number of lines to write to file %i"%npairs
lcount = 0
#first build vectors for each Entrez ID:
govectordict= {}
for entrezID in EntrezIDs:
    vec = []
    for domain in domains:
        for term in terms[domain]:
            try:
                if term in goEntrezdict[entrezID][domain]:
                    vec.append(1)
                else:
                    vec.append(0)
            except KeyError:
                vec.append(0)
    #save to dictionary
    govectordict[entrezID] = array(vec[:])
#iterate over combinations of Entrez pairs
for pair in itertools.combinations(EntrezIDs,2):
    #building the line as it goes
    line = list(pair) + list(govectordict[pair[0]]*govectordict[pair[1]])
    if lcount%1000000 == 0:
        print "%.4f"%(100.0*lcount/npairs)+"% lines written"
    lcount += 1
    #write this new line to file
    c.writerow(line)









    



Number of lines to write to file 165920436
0.0000% lines written

Left the above script to run overnight, it produced a 14gb file which is too large to query with wc -l in reasonable time. Instead, used [this script][qwc] to find the approximate number of lines. Should be relatively accurate in this case as the structure of the file is quite regular:



In [3]:

    
!~/qwc.sh geneontology.flat.pairs.txt









    



73501163 geneontology.flat.pairs.txt

Which is a little less than halfway to the full line count shown above.

Alternatively, we could avoid using the flat file altogether and save it to a database immediately. Attempting to write such a database to see how long that will take:



In [4]:

    
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")



In [5]:

    
import ocbio.extract



In [58]:

    
#how long will this take?
npairs = scipy.misc.comb(len(EntrezIDs),2)
print "Number of lines to write to file %i"%npairs
lcount = 0
#first build vectors for each Entrez ID:
govectordict= {}
for entrezID in EntrezIDs:
    vec = []
    for domain in domains:
        for term in terms[domain]:
            try:
                if term in goEntrezdict[entrezID][domain]:
                    vec.append(1)
                else:
                    vec.append(0)
            except KeyError:
                vec.append(0)
    #save to dictionary
    govectordict[entrezID] = array(vec[:])
#open database    
db = ocbio.extract.openpairshelf("geneontology.pairs.db")
#iterate over combinations of Entrez pairs
for pair in itertools.combinations(EntrezIDs,2):
    #building the line as it goes
    line = list(govectordict[pair[0]]*govectordict[pair[1]])
    if lcount%100000 == 0:
        print "%.4f"%(100.0*lcount/npairs)+"% lines written"
    lcount += 1
    #write this new line to database
    db[frozenset(pair)] = line
db.close()









    



Number of lines to write to file 165920436
0.0000% lines written
0.0603% lines written
0.1205% lines written
0.1808% lines written
0.2411% lines written
0.3013% lines written
0.3616% lines written
0.4219% lines written
0.4822% lines written
0.5424% lines written
0.6027% lines written
0.6630% lines written
0.7232% lines written






    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-58-560a737219f0> in <module>()
     28     lcount += 1
     29     #write this new line to database
---> 30     db[frozenset(pair)] = line
     31 db.close()

/home/gavin/Documents/MRes/opencast-bio/ocbio/extract.py in __setitem__(self, key, value)
    171         else:
    172             key = key[0] + "\t" + key[1]
--> 173         shelve.DbfilenameShelf.__setitem__(self, key, value)
    174         return None
    175 

/usr/lib/python2.7/shelve.pyc in __setitem__(self, key, value)
    130         f = StringIO()
    131         p = Pickler(f, self._protocol)
--> 132         p.dump(value)
    133         self.dict[key] = f.getvalue()
    134 

KeyboardInterrupt:

Seems like the efficient way to do this would be to use a generator. Scikit-learn would probably support a generator. Interestingly, since shelve and pickle can store arbitrary Python objects there's nothing stopping me from storing the generator as an object in a pickle file.

Then the final file we want to generate would be itself a generator which produces feature vectors corresponding to a list of protein pairs. This could be added as an option to the ocbio.extract.ProteinParser class, but some adjustment to the system would have to be done. The parsers themselves could be stored and then these used to index protein pairs.

Using a Generator

To build a generator function from the above code we just need to supply variables used when writing the above file:



In [78]:

    
def genGOfvector(pairs,goEntrezdict,terms):
    '''Generator function for creating feature vectors for Gene Ontology:
    Taking as input arguments:
        * list of pairs of proteins to map (pairs should be frozensets)
        * dictionary mapping from Entrez IDs to dictionary containing GO terms keyed by domain
        * dictionary of terms we're interested in keyed by domain
    Returns iterable sequence of corresponding feature vectors'''
    # First define the domains, I'm guessing they're not changing
    domains = ['molecular_function','cellular_component','biological_process']
    # get a flattened set of Entrez IDs:
    EntrezIDs = set(flatten(map(list,pairs)))
    # initialise the vectors we're going to be using:
    govectordict= {}
    for entrezID in EntrezIDs:
        vec = []
        for domain in domains:
            for term in terms[domain]:
                try:
                    if term in goEntrezdict[entrezID][domain]:
                        vec.append(1)
                    else:
                        vec.append(0)
                except KeyError:
                    vec.append(0)
        #save to dictionary
        govectordict[entrezID] = array(vec[:])
    #iterate over combinations of Entrez pairs
    for pair in pairs:
        #building the line as it goes
        line = list(govectordict[pair[0]]*govectordict[pair[1]])
        #yield this feature vector
        yield line



In [87]:

    
testpairs = itertools.combinations(list(EntrezIDs)[0:10],2)
fvectors = genGOfvector(list(testpairs),goEntrezdict,terms)



In [63]:

    
import pickle



In [93]:

    
f = open("testgen.pickle","wb")
pickle.dump(fvectors,f)
f.close()









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-93-429889b1b145> in <module>()
      1 f = open("testgen.pickle","wb")
----> 2 pickle.dump(fvectors,f)
      3 f.close()

/usr/lib/python2.7/pickle.pyc in dump(obj, file, protocol)
   1368 
   1369 def dump(obj, file, protocol=None):
-> 1370     Pickler(file, protocol).dump(obj)
   1371 
   1372 def dumps(obj, protocol=None):

/usr/lib/python2.7/pickle.pyc in dump(self, obj)
    222         if self.proto >= 2:
    223             self.write(PROTO + chr(self.proto))
--> 224         self.save(obj)
    225         self.write(STOP)
    226 

/usr/lib/python2.7/pickle.pyc in save(self, obj)
    304             reduce = getattr(obj, "__reduce_ex__", None)
    305             if reduce:
--> 306                 rv = reduce(self.proto)
    307             else:
    308                 reduce = getattr(obj, "__reduce__", None)

/usr/lib/python2.7/copy_reg.pyc in _reduce_ex(self, proto)
     68     else:
     69         if base is self.__class__:
---> 70             raise TypeError, "can't pickle %s objects" % base.__name__
     71         state = base(self)
     72     args = (self.__class__, base, state)

TypeError: can't pickle generator objects

This is a problem and there are two ways around it. Could write a pseudo-generator class, instantiate it and pickle it. Or, since I really only want to generate a single set of features each time could make it a function:



In [81]:

    
def genGOfvector(pair,goEntrezdict,terms):
    '''Generator function for creating feature vectors for Gene Ontology:
    Taking as input arguments:
        * list of pairs of proteins to map (pairs should be frozensets)
        * dictionary mapping from Entrez IDs to dictionary containing GO terms keyed by domain
        * dictionary of terms we're interested in keyed by domain
    Returns iterable sequence of corresponding feature vectors'''
    # First define the domains, I'm guessing they're not changing
    domains = ['molecular_function','cellular_component','biological_process']
    # initialise the vectors we're going to be using:
    govectordict= {}
    for entrezID in pair:
        vec = []
        for domain in domains:
            for term in terms[domain]:
                try:
                    if term in goEntrezdict[entrezID][domain]:
                        vec.append(1)
                    else:
                        vec.append(0)
                except KeyError:
                    vec.append(0)
        #save to dictionary
        govectordict[entrezID] = array(vec[:])
    #iterate over combinations of Entrez pairs
    #building the line as it goes
    line = list(govectordict[pair[0]]*govectordict[pair[1]])
    #yield this feature vector
    return line



In [61]:

    
testpairs = itertools.combinations(list(EntrezIDs)[0:10],2)
testpair = list(testpairs)[1]
fvector = genGOfvector(testpair,goEntrezdict,terms)



In [98]:

    
#then try to pickle this
f = open("testgen.pickle","wb")
pickle.dump(genGOfvector,f)
f.close()

Unfortunately, this function has no memory so we'd have to pickle the goEntrezdict and the terms separately. It is easier to create an object with it's own __getitem__ method which can then be instantiated and pickled with the required data stored. Then this can be handed straight into the parsing code and all that's required is to make sure it has a definition of the class available.



In [88]:

    
import numpy as np

class GOfvectors():
    def __init__(self,goEntrezdict,terms):
        self.goEntrezdict = goEntrezdict
        self.terms = terms
        return None
    
    def __getitem__(self,key):
        pair = list(key)
        if len(pair)==1:
            pair = pair*2
        # First define the domains, I'm guessing they're not changing
        domains = ['molecular_function','cellular_component','biological_process']
        # initialise the vectors we're going to be using:
        govectordict= {}
        for entrezID in pair:
            vec = []
            for domain in domains:
                for term in self.terms[domain]:
                    try:
                        if term in self.goEntrezdict[entrezID][domain]:
                            vec.append(1)
                        else:
                            vec.append(0)
                    except KeyError:
                        vec.append(0)
            #save to dictionary
            govectordict[entrezID] = np.array(vec[:])
        #iterate over combinations of Entrez pairs
        #building the line as it goes
        line = list(govectordict[pair[0]]*govectordict[pair[1]])
        #yield this feature vector
        return line

Then save the above to a file in ocbio, if rerunning this make sure the line number is correct:



In [95]:

    
cd ../opencast-bio/ocbio/









    



/home/gavin/Documents/MRes/opencast-bio/ocbio



In [97]:

    
%save -f geneontology.py 88









    



The following commands were written to file `geneontology.py`:
import numpy as np

class GOfvectors():
    def __init__(self,goEntrezdict,terms):
        self.goEntrezdict = goEntrezdict
        self.terms = terms
        return None
    
    def __getitem__(self,key):
        pair = list(key)
        if len(pair)==1:
            pair = pair*2
        # First define the domains, I'm guessing they're not changing
        domains = ['molecular_function','cellular_component','biological_process']
        # initialise the vectors we're going to be using:
        govectordict= {}
        for entrezID in pair:
            vec = []
            for domain in domains:
                for term in self.terms[domain]:
                    try:
                        if term in self.goEntrezdict[entrezID][domain]:
                            vec.append(1)
                        else:
                            vec.append(0)
                    except KeyError:
                        vec.append(0)
            #save to dictionary
            govectordict[entrezID] = np.array(vec[:])
        #iterate over combinations of Entrez pairs
        #building the line as it goes
        line = list(govectordict[pair[0]]*govectordict[pair[1]])
        #yield this feature vector
        return line



In [101]:

    
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")



In [102]:

    
import ocbio.geneontology



In [103]:

    
test = ocbio.geneontology.GOfvectors(goEntrezdict,terms)



In [104]:

    
print test[testpair]









    



[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]



In [98]:

    
cd ../../geneontology/









    



/home/gavin/Documents/MRes/geneontology



In [105]:

    
#pickle the instance with the data in it
f = open("testgen.pickle","wb")
pickle.dump(test,f)
f.close()



In [106]:

    
print testpair









    



('114787', '114785')

Saving terms

Saving the terms chosen using the Bait and Prey datasets from above in human readable format for reference:



In [110]:

    
f = open("geneontology.terms.txt", "w")
c = csv.writer(f,delimiter="\t")
for k in terms.keys():
    for t in terms[k]:
        c.writerow([k,t])
f.close()



In [111]:

    
!cat geneontology.terms.txt